FILENAME: FREEGRAM5.TXT
AUTHOR: Jim Roskind
        Independent Consultant
        516 Latania Palm Drive
        Indialantic FL 32903
        (407)729-4348
        jar@hq.ileaf.com
        or ...uunet!leafusa!jar

                                                        7/4/91

Dear C++ and C Grammar User,

I have written a YACC debugging tool, and a set of grammars for C  and 
C++ in order to use them within my own personal project development. I 
have  made  the  results  of  my  work in this area available to other 
developers at no charge with the hope that they would use my work.   I 
believe   the   entire   C++   community   can   benefit   from   such 
standardization.  If any of the  copyright  notices  on  the  grammars 
(which  are  VERY  liberal) prevent using my work, please notify me of 
the problem.

Note that the grammars can each be processed by  YACC,  but  they  are 
very clean, and make NO USE of the precedence setting (i.e.: %prec) or 
associativity  setting (i.e.:%assoc) constructs of YACC.  This feature 
should make them easily  portable  to  other  parser  generator  input 
format.   This "cleanliness" fact also provides brutal exposure of all 
the complex constructs in C++, and the complexity of the grammar as  a 
whole  (the  C++  grammar  is  2 to 3 times as large as the C grammar) 
reflects this exposure.

The files included in this set are:

    FREEGRM5.TXT    This introductory file
    GRAMMAR5.TXT    Parsing ambiguities in C++, and in my grammar
    CPP5.Y          My YACC compatible C++ grammar
    C5.Y            My YACC compatible, ANSI C conformant grammar
    CPP5.L          Flex input file defining a C++ lexical analyzer
    SKELGRPH.C      A hacked file from the Berkeley YACC distribution
    AUTODOC5.TXT    Documentation for my machine generated analysis
    Y.OUTPUT        Machine generated analysis of the C++ grammar.

Aside from the addition of several files, this release of  my  grammar 
corrects  a  few  problems  located  in my prior release.  I have also 
transitioned to using names in my grammar that are more acceptable  to 
a  wider  variety  of  parser  generators.  This release also includes 
support for nested types (at  least  grammatically,  as  there  is  no 
symbol  table  provided).  It does not support templates and exception 
handling, as the ANSI C++ Committee  is  still  discussing  variations 
(and  trying  to  deal  with a variety of ambiguities that the initial 
proposals, such as what is described in the ARM, would entail).

Since my first public release of my grammar, I have received a  number 
of  requests.   One  of  the  most  common  requests was for a lexical 
analyzer to  go  with  the  grammar.   This  release  of  the  grammar 
continues  to  provides  such  a a "bare bones" lexical analyzer.  The 
analyzer does not support preprocessing, or even comment removal.   In 
addition,  since  I  have  not  included  a  symbol table, or semantic 
actions in the grammar  to  maintain  proper  context  (i.e.,  current 
scope),   typedef  names  and  struct/class/union/enum  tags  are  not 
*really* defined.  To  allow  users  to  experiment  with  my  grammar 
without  a  symbol table, my lexer assumes that if the first letter of 
the name is upper case, then then name is a type name.  This  hack  is 
far  from  sufficient  for parsing full blown programs, but it is more 
than sufficient for experimenting with the grammar  to  determine  the 
acceptability  of  a  token sequence, and to understand how my grammar 
parsed the sequence.

Since I did not  believe  that  a  lexical  analyzer  alone  would  be 
sufficient  to assist many people with playing with my grammar, I have 
also provided the basis for a tool to explain what a grammar is doing. 
Specifically, I have modified a file that is included in the  Berkeley 
YACC  distribution  so  that  parsers  generated  by such a YACC would 
automatically display a syntax tree in graphical-ASCII format during a 
parse.  The instructions for using and building  this  yacc  tool  are 
presented  in  the  next  section.  Note that there are no significant 
special hooks in my grammar or parser to excite this  yacc  tool,  and 
the  tool can be used equally well on any grammar that you are working 
with.  This graphical debugging tool  is  probably  one  of  the  most 
popular  aspects  of  my  releases, and its presence and usefulness to 
grammar developers should not be underestimated.

Significantly new to this  release  is  a  large  file  that  contains 
machine  generated  documentation (re: Y.OUTPUT).  This file goes well 
beyond what is provided in a typical verbose output, and provides both 
detailed conflict analysis, and a  number  of  cross-references  which 
make  it  **MUCH**  easier to read the associated grammar.  I have not 
yet decided whether  to  market,  shareware,  or  plain  give-away  my 
program,  so the best I can do at this point is to release the machine 
generated documentation.  Unfortunately, this file  is  *very*  large, 
and  I have decided (for the time being) to distribute it only via the 
ftp sites only.  I am  doing  this  to  lessen  the  global  bandwidth 
utilization  during my grammar posting to the network.  I will however 
post the file (AUTODOC5.TXT)  which  documents  the  contents  of  the 
Y.OUTPUT  file,  so that users can decide if they want to download the 
larger  file.   To  hint  at  what  is  included  in   the   automatic 
documentation, the following are the sections:

	Reference Grammar
	Alphabetized Grammar
	Sample Expansions for the Non-terminal Tokens
	Summary Descriptions of the Terminal Tokens
	Symbol and Grammar Cross Reference for the Tokens
	Sample Stack Context and Accessing Sentences for each State
	Concise list of Conflicts
	Canonical Description of Conflicts
	Verbose listing of state transitions
	Explanations for all reductions suggested in conflicts

Please see AUTODOC5.TXT for more details.

I  have  posted  7  of  the  8 files to comp.lang.c++ (I will not post 
Y.OUTPUT due to its size), to make this information  as  available  as 
possible  to users and developers.  I will also post this introductory 
note to comp.compilers, and comp.lang.c.  I am arranging for  archival 
support  via  several  ftp  sites, and updates will be posted to those 
sites.  I will also try to get the source to Berkeley YACC  posted  to 
these  ftp  sites,  although it is certainly available at more central 
sites.

Currently, Doug Lea and  Doug  Schmidtt  have  graciously  offered  to 
provide  anonymous  ftp  sites  for  all  8  of  files, as well as the 
Berkeley YACC source (if you need it).  The ftp addresses are:

ics.uci.edu (128.195.1.1) in the ftp/pub directory as:

	c++grammar2.0.tar.Z 
	byacc1.8.tar.Z

mach1.npac.syr.edu (128.230.7.14) in the ftp/pub/C++ directory as:

	c++grammar2.0.tar.Z
	byacc1.8.tar.z


HOW TO EXPERIMENT WITH THE C++ GRAMMAR

The following describes how to use the graphical debugging  extensions 
to Berkeley YACC to explore the grammar.

Note that the following instructions assume that you have the Berkeley 
YACC  source  on  hand.   You can actually use any YACC to process the 
grammar, but  running  the  resulting  demo  (which  has  no  semantic 
actions)  will  tend to be quite boring.  If you can't get hold of the 
Berkeley YACC, you should  at  least  try  to  enable  the  "debugging 
options"  in  your  parser  to  so  that  you can see in some way what 
reductions are taking place.  (Hint: search for the letters "debug" in 
the C file that your yacc produces...).

        1) Get the entire source for Berkeley YACC 1.8 1/2/91
        2) Verify that you can make the YACC
        3) Rename SKELETON.C to SKELOLD.C, and implant my SKELGRPH.C
                in that directory as SKELETON.C
        4) Make the yacc using this new SKELETON.C
        5) Using the above yacc, process my CPP5.Y file
                yacc -dvl cpp5.y
           The result should be a file y.tab.c, and y.tab.h
        6) Using Flex (replacement for lex) to process my CPP5.L file
                flex cpp5.l
           the result should be yy.lex.c
        7) Compile the two files
                cc -o cpp5  y.tab.c yy.lex.c
           the result should be an executable called cpp5
        8) Set the environment variable YYDEBUG to 6
                setenv YYDEBUG 6
           If you don't do this, the graphical output will not appear!
        9) Run the program cpp5
                cpp5
        10) Try the input:
                int a;
        11) You should see a nice parse tree.  Enjoy.  Note that
            the lexer DOES NOT INCLUDE A SYMBOL TABLE, and does
            NOT KEEP TRACK OF CURRENT SCOPES.  The hack (see the
            CPP5.L file for details) is to assume that any identifier
            that begins with a capital letter is a typedef name
            Send complaints about code that doesn't parse "correctly".


HISTORICAL NOTES


Developing the C grammar (that is intended to be compatible  with  the 
ANSI  C standard) was relatively straight forward (compared to the C++ 
grammar).  The one difficulty in this process was the desire to  avoid 
use  of  %prec  and  %assoc  constructs  in  YACC, which would tend to 
obscure ambiguities.  Since I didn't know what ambiguities were  lying 
in  wait  in  C++,  obscuring  ambiguities  was unacceptable.  It took 
several weeks to remove the conflicts that typically appear,  and  the 
tedious  process  exposed  several  ambiguities that are not currently 
disambiguated by the ANSI standard.  The quality of the C  grammar  is 
(IMHO)  dramatically  higher  than what has been made available within 
the  public  domain.   Specifically,  a   C   grammar's   support   of 
redefinition  of typedef names within inner scopes (the most difficult 
area of the grammar) is typically excluded from public domain grammar, 
and even excluded from most grammars that  are  supplied  commercially 
with  parser  generators!   I  expect  that  this grammar will be very 
useful in the development of C related tools.

The development of the C++ grammar (initially compatible with  version 
1.2,  but  enhanced to support version 2.0 specifications as they were 
made available) was anything but straight  forward.   The  requirement 
that  I  set  to NOT USE %prec and %assoc proved both a blessing and a 
curse.  The blessing was that I could see what the  problems  were  in 
the  language, the curse was that there were A LOT of conflicts (I can 
recall  times  during  the  development  effort  when  the  number  of 
conflicts  was  well  in  excess of 200).  The most recent addition of 
nested types probably took about 2 weeks to implement.  On  the  other 
hand,  I  probably  spent  several  months  developing  the  automated 
documentation tools that allowed me to  debug  the  grammar  additions 
this quickly. 

Towards  the  end of the initial development of the C++ grammar, which 
took roughly 3 months of my time (circa summer 1989), I began  to  see 
the  folly in part of my quest.  I came to the conclusion that further 
attempts  to  modify  my  grammar,  so  as  to  defer  resolutions  of 
ambiguities,  would  lead  to an unreadable language. Specifically, my 
feeling was that I was  entering  into  a  battle  of  wits  with  the 
compiler,  and  the compiler was starting to win.  It was beginning to 
be the case that the parser COULD  figure  out  what  I  said,  but  I 
couldn't.  Indeed, even examples in a version of the C++ 2.0 reference 
manual (and published in the ARM) demonstrated this problem (my parser 
could  parse  some  examples  that  neither  I  nor the authors parsed 
correctly!).  At this point I decided to  stop  my  quest  to  FURTHER 
defer  resolutions  of  ambiguities, and let the grammar commit in one 
direction (always in favor of declarations), at the late point that is 
provided by my grammar.  If this direction proved "incorrect in  light 
of  the  context  that  followed", then I generated a syntax error.  I 
believe this strategy provides  ample  room  for  expressiveness.   In 
support  of  this expressiveness, I have (based on my discussions with 
language  experts)  deferred  disambiguation  far  longer  than  other 
attempts  at  producing an LR(1) grammar.  I would strongly argue that 
any code that my grammar identifies as having a "syntax error"  (based 
on  "premature"  disambiguation), but cfront allows, should ABSOLUTELY 
be rewritten in a less ambiguous (and hence more portable) form.

It should be noted that my grammar cannot  be  in  constant  agreement 
with   such  implementations  as  cfront  because  a)  my  grammar  is 
internally consistent (mostly courtesy of its formal nature  and  YACC 
verification),  and b) YACC generated parsers don't dump core. (I will 
probably take a lot of flack for that last snipe, but.... every time I 
have had difficulty figuring what  was  meant  syntactically  by  some 
construct that the ARM was vague about, and I fed it to cfront, cfront 
dumped core.)

One major motivation for using the C++ grammar that I have provided is 
that it is capable of supporting old style function definitions (e.g.: 
main(argc,  argv)  int  argc;  char*argv[];  {...}  ).  I believe this 
capability was removed from the C++ specification in order  to  reduce 
ambiguities  in  a  specific  implementation  (cfront).  As my grammar 
demonstrates, this action was  not  necessary.  Supporting  old  style 
function  definition  GREATLY  eases  the transition to the use of C++ 
from traditional C.  I expect that as some parsers  begin  to  support 
this  option,  that  other  parsing  engines  will  be  forced in this 
direction by a competitive marketplace.  Using  my  grammar,  and  the 
standards it implies, appears to be a very straightforward approach to 
this support.

A  second  motivation for using my grammar is that it can be processed 
by YACC.  The advantage in this fact lies with  YACC's  capability  to 
identify  ambiguities.   For  software  manufacturers that are heavily 
concerned with correctness,  this  is  an  INCREDIBLE  advantage.   My 
experience  with  hand  written  parsers  (which  usually  represent a 
translation by a fallible human from a grammar  to  parsing  code)  is 
that  they  evolve and become more correct with time.  Ambiguous cases 
are often misparsed, without the author ever  realizing  there  was  a 
conflict!   In  contrast,  either  a  YACC  grammar  supports  a given 
construct, or it doesn't.  If a YACC grammar supports a construct, the 
semantic interpretation  is  usually  rather  straight  forward.   The 
likelihood of internal errors in the parser is therefore SIGNIFICANTLY 
reduced.  The  fact the the grammars I supplied are free of %assoc and 
%prec operators, implies the grammar  are  fairly  portable,  and  the 
conflicts are open to peer code review (and not obscured).

Most  recently  I have joined the ANSI C++ committee (X3J16), and have 
tried to follow their progress with hopes of maintaining compliance in 
my grammar.  Unfortunately, political pressures within X3J16 appear to 
make it IMHO more desirable to quickly approve a standard that matches 
cfront's performance (when it is not dumping core), than to provide  a 
clean,  consistent  and formal syntax as part of the standard.  Rather 
than fixing inconsistent hackery within the syntax, there  is  IMHO  a 
tendency   to  want  to  "hack  further"  to  match  cfront's  current 
performance (or the ARM's prophesy).   As  an  example  of  this,  the 
fundamental  hack  in  all of C is the feedback from the parser to the 
lexer to identify typedefnames.  There is discussion afoot to (for  no 
reason  other than compliance with a *proposed* cfront feature) extend 
this another notch and require feedback to distinguish template names. 
This hackery was not required by the syntax, rather it was "desirable" 
to match the performance of beta-cfront (and the  ARM).   When  cfront 
changes,  and  old  code is obsoleted, the arguments abound that it is 
for the good of humanity.  When cfront is hacking inconsistently, then 
no change can be made, because of the thousands of lines of code  that 
depend  on this psuedo-standard.  Perhaps my grammar will help in some 
small  way  Microsoft,  Zortech,  Borland,   and   dozens   of   other 
entrepreneurs  work toward building a standard for a language that has 
enough consistency to grow and flourish (note that none of  the  above 
vendors  use  my grammar in their products, but I think they would all 
share my desire for a cleaner syntax).  If I  am  successful  with  my 
grammar,  then  I  will be able to write C++ tools in a consistent and 
open marketplace.  From my perspective, the outcome is not clear.   If 
you have a channel to support the use of a cleaner syntax in the X3J16 
standard, I would heartily invite you to exercise that channel.

As  it  currently  stands,  my  grammar  teeters  on the edge of being 
unusable due to its size.  The size in turn is due to the  variety  of 
special cases that must be dealt with within C++ parsing.  With only a 
few more inconsistent additions to the "standard language", my grammar 
will surely become completely unusable.  I am trying to develop a yacc 
preprocessor  that will allow me to rein back in the complexity of the 
grammar.  If I can do this, then it will continue to  be  possible  to 
update  my  grammar  to  match  the emerging ANSI Standard. I can only 
promise to try.


FEEDBACK ABOUT THE GRAMMARS

If you find any errors in my grammars, I would be  DELIGHTED  to  hear 
mention  of  them!!!!   These  should  fall  into one of the following 
categories:

        1) The grammar left out the following features of C++...
        or
        2) The grammar mis-parses the following sequences...
        or
        3) The discussion of the following ambiguity is
        incorrect...
        4) The grammar could be simplified as follows...

Please send  correspondence  of  this  form  to  jar@hq.ileaf.com.  My 
response  to  1's  will be to add the feature (if possible!); feel sad 
that I made a mistake; and feel glad that YOU found it.  I will have a 
similar response to 2's.  Responses of type 3 are GREAT, but I haven't 
found many folks that really get into YACC ambiguities, so I have  low 
expectations...  feel free to surprise me!!! :-) :-).  Items of type 3 
are interesting, but since simplicity is in the eye of  the  beholder, 
such  suggestions  are  subject  to  debate.  I would be interested in 
seeing suggestions in this area with the constraint that they  do  not 
increase  the  number of conflicts in the grammar!  Please use YACC to 
check your suggestions before submitting them. (It  is  often  AMAZING 
how  the  slightest  change  in  the  grammar can change the number of 
conflicts, and it took a great deal  of  work  to  reach  the  current 
state).  Distribution site(s) will be set up to distribute updates and 
or corrections.  Postings about the presence of  corrections  will  be 
made on the net.

Since  the  two  grammars  (C and C++) were generated in parallel, you 
should be able to compare non-terminals directly.  This will hopefully 
make it easier to identify the  complexities  involved  with  the  C++ 
grammar, and "ignore" those that result from standard ANSI C.  In both 
cases  I  have  left the standard if-if-else ambiguity intact.  In the 
case of ANSI C grammar, this is the only shift-reduce conflict in  the 
grammar.  Although there are a number of conflicts in the C++ grammar, 
there  are  actually  very  few  classes  of  problems.  In  order  to 
disambiguate the C++ grammar enough that YACC can figure out  what  to 
do,  I  was  commonly forced to "inline expand" non-terminals found in 
the C grammar.  This expansion allowed YACC  to  defer  disambiguation 
until  it  was possible for an LR(1) parser to understand the context. 
The unfortunate consequence of this inline expansion is a large growth 
in the number of rules, and the presence of an effective  "multiplier" 
in  most  cases  where  conflicts do occur. As a result, any conflicts 
that arise are multiplied by a factor corresponding to the  number  of 
rules  I  had  to  list  separately.   I  have grouped the C++ grammar 
conflicts in the "Status" section of the GRAMMAR5.TXT paper,  but  you 
are welcome to explore my grammars using YACC directly (be warned that 
you  will  need  a  robust  version  of  YACC  to handle the C++ sized 
grammar).  PLEASE do not be put off by the number of conflicts in  the 
C++  grammar.  There are VERY FEW CONFLICTS, but my elaborated grammar 
confuses the count.

The GRAMMAR5.TXT paper is FAR from a publishable quality paper, but it 
discusses many of the issues involved in ambiguities  in  my  grammar, 
and  more  generally  in the C++ language. I hope GRAMMAR5.TXT it is a 
vast improvement over "nothing at all", but apologize in  advance  for 
lack of polished structure, and the presence of many typos (which must 
surely be present).  I hope you find this almost-paper interesting. My 
attempts   at  documenting  conflicts  have  certainly  clarified  the 
problems in my mind. Based on my experience with the conflicts I  have 
identified,  most  current  compilers  and translator fall prey to the 
situations that I have uncovered.  I hope that other  compilers,  that 
do  not  make  use of the grammar I have made available, will at least 
seek to standardize the resolution of the problems identified.


The AUTODOC5.TXT file provides interesting reading  for  both  readers 
interested  in  LR  and  LALR  parsing (and the subtle connections and 
distinctions between them), as well as any user that wishes  to  fully 
comprehend  the  contents  of  the  Y.OUTPUT  file.   It  includes  an 
extensive  discussion  of  ambiguities,  how  they  are  removed,  how 
LALR-only ambiguities arise, and how they can be dealt with.

With  this  release  of the grammar I have begun to distribute machine 
generated documentation for my grammar.  As a result, if  my  analysis 
of  conflicts  are  questionable,  the  supporting data is immediately 
present to confirm or deny my analysis.  If you wish to correct any of 
my analysis, please use and refer to the Y.OUTPUT  file  that  I  have 
provided. 

As  a  small  commercial message... I am a freelance consultant, and I 
travel far and wide to perform contracts.  If you like the work that I 
am presenting in this group of documents, and  would  like  to  see  a 
resume or at least talk, please feel free to contact me.

I  hope  that  the  grammars  that  I have provided, will lead to many 
successful C++ processing projects.

Jim Roskind 
Independent Consultant 
516 Latania Palm Drive  
Indialantic FL 32903 
(407)729-4348 
jar@hq.ileaf.com or ...!uunet!leafusa!jar